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Preface 


Welcome to “The Art and Science of Transformer: A Breakthrough in 
Modern AI and NLP" This book is designed for anyone eager to understand 
the revolutionary transformer architecture that has significantly advanced 
the field of artificial intelligence. Whether you are a student, an aspiring 
data scientist, or a professional looking to expand your knowledge, this book 
aims to make the complex world of transformers accessible and 
understandable. 


The journey to creating this book began with my own experiences in the Al 
field. Many people are daunted by the intricate mathematics and advanced 
concepts that underpin transformer models. However, we already know 
about the immense potential and transformative power of these models in 
natural language processing and beyond. This realization fueled my desire 
to demystify transformers and make them approachable for everyone, 
regardless of their technical background. 


In “The Art and Science of Transformer: A Breakthrough in the Modern AI 
and NLP” we break down the core components of transformer architecture 
into clear, digestible segments. We start with the basics, such as word 
embeddings and attention mechanisms, and progressively build up to more 
advanced topics like self-attention, positional encoding, and multihead 
attention. Each concept is explained with simple language, intuitive 
analogies, and visual aids to ensure that you not only understand how 
transformers work but also why they are so effective. By the end of this 
book, you will have a solid understanding of transformer architecture and be 
equipped with the skills to apply these models to solve various problems. 


Iam grateful to the AI community for their groundbreaking research and 
continuous innovation, which have laid the foundation for this book. I also 
want to thank the colleagues, friends, and family for their support and 
encouragement throughout this project. 


My hope is that "The Art and Science of Transformer: A Breakthrough in the 
Modern AI and NLP" will not only educate but also inspire you. 
Transformers have the potential to unlock new possibilities and drive 
progress in numerous fields. By making this knowledge accessible, I aim to 
empower you to harness the power of transformers and contribute to the 
exciting future of artificial intelligence. 


Thank you for embarking on this journey with me. Let’s make transformers 
easy and unlock their potential together. 


Debstuti Das 
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Introduction 


In recent years, the field of artificial intelligence has undergone a paradigm 
shift with the advent of transformer architectures. Introduced in the 
seminal paper "Attention is All You Need" by Vaswani et al. in 2017, 
transformers have revolutionized natural language processing (NLP) anda 
variety of other AI applications. Their ability to handle sequential data with 
unparalleled efficiency and accuracy has set new benchmarks in tasks like 
language translation, text generation, and even image processing. 


This book, "Transformers Theory Made Easy," aims to demystify the 
complex concepts behind transformer architectures. We will start by 
exploring foundational elements such as word embeddings and basic 
attention mechanisms. Gradually, we will delve into more advanced topics 
like self-attention, positional encoding, multi-head attention, and the entire 
transformer architecture. Through simple and detailed explanations and 
illustrative examples, we aim to demystify the complexities of this 
technology, offering insights into how Transformers achieve unprecedented 
performance in tasks such as translation, summarization, and beyond. 


Whether you are an AI researcher, a data scientist, or an enthusiast eager to 
understand the backbone of modern AI advancements, this book is your 
suide to mastering the sophisticated engineering behind Transformer 
architecture in the easiest way possible. 


Prerequisites 


Before diving into the intricate structure and functioning of Transformer 
architecture, it is essential to establish a solid foundation in several key 
areas. A strong grasp of linear algebra, particularly matrix operations and 
vector spaces, is crucial for understanding the mathematical underpinnings 
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of attention mechanisms. Familiarity with probability theory and statistics 
will aid in comprehending the probabilistic models often used in machine 
learning. Additionally, a good understanding of neural networks, including 
concepts such as backpropagation, gradient descent, and various types of 
layers (e.g., convolutional and recurrent layers), will provide a necessary 
context for appreciating the innovations introduced by Transformers. By 
mastering these prerequisites, readers will be well-equipped to delve into the 
sophisticated world of Transformer architecture and fully appreciate its 
groundbreaking capabilities. 


What You Will Learn 


First we will discuss few more essential prerequisites, including word 
embedding techniques and the basics of Recurrent Neural Networks (RNNs). 
From there, we delve into the core components of Transformers, starting 
with an in-depth analysis of Attention Mechanism, Self-attention and the 
role of multi-headed attention. We explore positional encoding and its 
significance in handling sequential data without recurrence. Detailed 
chapters are dedicated to understand each piece of Transformer 
architecture, explaining how they interact to process and generate data. 


Let's dive into the fascinating world of transformer architecture and unlock 
its full potential together. 
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Word Embedding 


Overview 


Machines can understand only numbers. So we need some way to convert 
our text inputs to numbers, or say to a set of numbers( a vector). Word 
embedding is a pivotal technique in natural language processing that 
transforms words into continuous vector representations, capturing their 
semantic relationships and meanings. 


Before jumping into the state-of-the-art word-embedding techniques let’s 
discuss some traditional methods we used to and still use. We had several 
ways to embed a word or sentence to a vector such as TF-T'DF vectoriser, 
one-hot encoding etc. But these methods come with their own limitations 
and majorly these embeddings does not capture the relationship between 
words or surrounding/context informations. Yet let’s start with these 
simplest techniques and see how things improved. 


Single number representation 


Let’s say our vocabulary is V with size |V| =n. Now there will be some order 
of the words in the vocabulary. Usually it’s in alphabetically ascending order. 
So every word is given one position associated with it which will be in range 
from 1 ton. 


TF-IDF vectorisation 


This is a sentence representation, that works with the concept of TF (term 
frequency) and IDF (Inverse document frequency). It is a statistical 
measure used to evaluate the importance of a word in a document relative to 
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a collection of documents. Each word or token is referred as a term here. 
The calculation is done by multiplying the Term-Frequency and Inverse- 
Document-Frequency of the term in the entire corpus. Let’s consider the 
entire corpus is D, which contains k documents d, fori =1,2..... k. 


Term-frequency TF(t, d) 


Term-frequency for a term rin the document d defines the importance of 
term fin the document d. 


count of term ft in the document d 


count of all terms in the document d 


Inverse-Document-Frequency IDF(t, D) 


Let’s say our corpus is of size 1OOOO and 100 of the document contains term 
t. Then DF or document frequency of term tis 100. The more number of 
document contains the term, the more common it is. Like the term “the” will 
be present in almost all the documents. So it should not be given that much 
importance. Hence we calculate inverse document frequency. 


i ( total number of documents (k) in the corpus ) 


number of documents containing the term f 


TF-IDF(t, d, D) - 


Now the TF-IDF is the product of term frequency and inverse document 
frequency. 


TF(t, d)*IDF(t, D) 
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One-hot vector representation 


Here we will represent every word as a | V| dimensional vector, where 
|V| =7n is the size of the vocabulary. In the vocabulary every word will 
have 

one position associated with it starting from 1 ton. In the |V| size vector 
only in the place of that position we will have 1, rest are zeros. We are 
denoting the vector corresponding to word w as v,,. 


If “aardvark” and “abacus” are the first and second words and “zyzzyva” is 
the last word in the vocabulary, then their one-hot representation will look 
like below. 


] 0 0 
0 ] 0 
Vaardvark = 0 > Vabacus = 0 Veyzzyva a 0 
0 0 1 


Bag of Words 


There is a concept of Bag-of-words(BOW) representation of a document d. 
Let’s Consider there are 7 words in the document d and we denote the one- 
hot vector representation for word w, as v,,. Here the Bag-of-words 


representation of a document d, let’s say we callit V,, will sum all the word 


vectors. 
T 
Ya= 3s Yw; 
i=] 


Also we can take a weighted sum of the word vectors, 


T 
Vi= » Oy Vy, wherea, 20 Wi 
i 


Here the weights a, § can be usually proportional to the frequency of the 


words in the document d. Word orders or context informations are not 
captured here. 


Problems with One-hot vector 


¢ One-hot vectors are large dimensional as size of the vocabulary |V| will 
be very large. 


e These are sparse vectors with only 1 in one position and rest are zeros. 
Document corpus containing m documents is represented by a very 
sparse matrix of size |V| x m. 


e Ifyou see the two sentences here “cat eats fish” and “fish eats cat”, both 
have same Bag-of-words document representation using the one-hot- 
vectors. In place of cat, eat and fish we will have 1, and rest are zero. But 
these two sentence has completely different meaning. 


_—— © 


Veat eats fish = , = Veish eats cat 


a <> Ta ae <> 


SS _ 


e Here each word is treated as independent of the rest of the words. 
Consider “New York” as a phrase, which has a different meaning. But 
Viewer: Vaew ® Vyore Here: 

e There is no notion of similarity in this vectors. Word vectors of related 
words are not close to each other. Now, the similarity can be measured by 
dot-product or cosine similarity. Each word vector has 1 in one place and 
rest are zero, or we can say each vector is perpendicular to each other 
and their dot-product and cosine similarity will always be O. Consider 
apple and orange, both are fruit. So how do we represent this similarity? 
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T 
Vanple’, 
: aes : pple” orange 
Cosine_similarity(apple, orange) = 


| | Vapple | | | | Vorange| | 


Will bi-gram or tri-gram solve some of the problems? 


In addition to the vocabulary, we can take bigram(two words at atime) or 
trigram ( 5 words at a time) or n-gram. Like in the sentence of “cat eats fish” 
we can take the bigrams “cat eats” and “eats fish”, or we can take trigram 
“cat eats fish”. So these two sentences “cat eats fish” and “fish eats cats” 
won’t be represented by the same vector as their bigrams/trigrams will be 
different. 


Now the vocabulary size will increase a lot to consider bigrams and 


trigrams, so this does not look like a good solution.Also, the relationship 
among the words are not captured here. 


How capturing word relations will help? 


Consider a word embedding which maps words to dense, low-dimensional 
spaces where similar words are positioned closer together. Here each 
element of the word vector denotes some feature of the word, like one 
feature can be if it is a pet animal, another can be how much the word refers 
to technology etc. However, these are just for our understanding purposes. 
Computers won’t understand this type of features. But computers should be 
able to generate such feature vectors for the words. 


— .54 — .65 5.8 

9.5 11 0.6 
Veat = 13 ’ Vdog =|1.1 ’ Vmobile = | — 9.2 

—2.1 = 3.9 4.2 


If we have such representation, then vector calculations will be possible. 


Vking — Yman + Vyoman = Vaueen 


Vaunt — Ywoman 1 Vman — Yuncle 


If we see the vectors here in the 2D space above, ‘Man’ and ‘Woman’ are 
close and similar to each other. Similarly, king and queen are close to each 
other. If we add ‘King - Man’ and ‘Woman’ we should get a vector close to 
‘Queen’. 


Queen = King - Man + Woman 


Looks cool isn’t it? 

Consider a Simple such vector representations here, We are considering the 
attributes power, rich, technology, gender in a 4-dimensional space 

King and Queen both will have power and will be rich, hence have higher 
values. Nothing is related to technology here hence it is O for technology 
attribute. Gender is denoted as -1 for male and 1 for female. 


igi 09 Oo8 05 04 
= 0.89 085 0.45 0.38 
os 0 0 0 0 
ei -1 1 1 1 


These attributes are for our understanding, computers can find out some 
random set of attributes. 


Now what we want is somewhat like below. King - Man + Woman should 
almost be equal to Queen. 


0.89 0.4 0.38 
— op — ~ 
0 0 0 0 
{ -1 1 1 
Word2Vec 


WordeVec is a family of different model architectures and optimisations to 
learn word embeddings from large document sets. Word&vec was created, 
patented, and published in 2015 by a team of researchers led by Tomas 
Mikolov at Google. 


There is one quote from J.R. Firth, a famous English linguist, 


“You shall know a word by the company it keeps”. 
-J.R. Firth(1957) 


The main concept behind wordevec is “similar words tend to occur together 
and will have similar context". Here the term ‘similar’ does not mean two 
words need to be synonyms, rather it means they can be related. 


Let’s consider the example below where centre word ‘fox’ is shown in pink 
and its context words for different window size is shown is green. 


The quick brown fox jumped over the lazy dog. [window size =O] 
The quick brown fox jumped over the lazy dog. [window size=1] 
The quick brown fox jumped over the lazy dog. [window size=e2] 


The number of words in the context on the left size is equal to the number of 
words on the right side. 
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There are two main models here that we will train and as a result of that we 
will get our word embeddings. 


- Skip-gram model: Predicts context words within a certain range before 
and after the current centre word in the same sentence. 


- Continuous bag-of-words model: Predicts the centre word based on 
surrounding context words. The context consists of a few words before 
and after the current (centre) word. 


One thing to mention here, predicting context words or centre word is a fake 


problem we are solving here. And as a result of that we will be getting our 
word embeddings from the Embedding Layer(shown later). 


Skip -gram Model: 


Skip-gram model predicts the context words given a centre word. We are 
considering window size of & here. So for word w, the context words will be 


Wr_1> Wr_-2> Weds Wry- 


In the picture below we see one-hot vector layer, then one embedding layer 
( that will be trained after the model training) and a softmax layer for each 
output position f+ j. 


How do we train this model? 


We generate input output examples for the model training. We can consider 
different centre words and their context words by sliding the context 
window. 

Considering the same sentence as above: 


“The quick brown fox jumped over the lazy dog. “ 


Embedding Layer Wr] 


One-hot vector 
representation 
of aword 


Wi42 


| 


Softmax Layer 


For centre word ‘fox’ context words are quick, brown, jumped, over 
considering window size is 2.So0 we can generate below (input, output) 
examples and train the model: 


- (fox, quick) 

- (fox, brown) 
(fox, jumped) 
- (fox, over) 


First let’s see the skip-gram flow for the position t+ 7. What is the probability 
that w,,,=w;, foraw; € V? 

The one-hot vector representation is connected to the d-dimensional 
embedding layer (h) which is again connected to the softmax layer. The 
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Fully connected 


Output 
probability 
vector for 
position 
t+j 


! d-dim 
One-hot vector Embedding Layer 
representation 

of aword 


GOoeocecoodoaodoood 


Softmax Layer 


probability of word at t+ j to be w;is determined by using the softmax 
function. 
The basic skip-gram formulation defines h = v,,. 


exp(h"v,,) exp (Vi Mw) 


Og = Wil Wd eevee Daeg OL a) 


What will be the loss function for this model training? 


The way we are using skip-gram, it is not a classification problem, and we 
can’t use cross-entropy loss. And also it is not truly a supervised learning 
problem as one word can have different context words. So we need our own 
loss function. [ Feel free to skip the Mathematical jargon used in this chapter] 
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Considering input sentence has words w,, W,....Wr. 
Goal: 


DO LOr (St eck: 


Given the centre word w,, predict the context words within a window of 


fixed size k. 


For a given input sentence, we want to maximise the likelihood of w, +j 


position f+ /. 


T 
Likelihood L=]] |] Poni) 
t=1 —k<j<k,j#0 


We have to maximise the likelihood or we can minimise the negative log- 
likelihood, and that will be our optimisation problem for loss function. 


Optimisation problem: 


1 1 T 


T / 
exp Ww Mw..) 


Where, P(w,,; |w,) ==—————— 
Dey oR Oa 


and logP(W,,;),) = viy, —log » Va Yw 


Wr Wr4j 
weV 


This is the loss function we would like to use for our training problem. 


Drawbacks of skip-gram: 


Computations of P(w )is expensive when vocabulary size is large. 


t+j|w, 


in the 


In the part log » vy,» for this log operation we need to sum v,, v,, for all 
weV 
word w in the vocabulary and that is really expensive. 
OPW itn) 
Also, we need to calculate the gradient a. and back propagate that 


Wy 


for our learning, which is very expensive. 


Can we convert this into a classification problem? 


Let’s try to construct a new training set containing positive examples as 
positive and few negative samples generated. This approach is known as 
negative sampling. 


Consider context(w,) and non — context(w,) are one context and non-context 
word of the word w,. 


Positive examples - ((w,, context(w,), 1), where for a centre word w, and each 
of its context words we will label it as positive or 1. 


Negative examples: ((w,, non — context(w,), 0), where for a centre word w,, few 
of its non-context words will be labeled as negative or O. The negative 
examples should be independent and randomly sampled. 


Now, again let’s do some math to understand the classification. (Skip this 
section if you want) 


Probability of an example belonging to class 1 (or positive) can be written 
as, 


1 


P(D = 1|w, w,,,) = ov v,,_) where o(x) = ——————— 
(D = 1 op Wa) = OO HYig) = Ty expl— 9) 


Probability of an example belongs to negative class will be 
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1— PWD = 1|w, w,4,;) 


Now if we take K negative samples, we have K+1 independent samples 
containing one positive corresponding to the centre word w,. 


K 
Pw, jw) = PD = 11, w4) | ] PD = 01m, we) 


k=l 
K 
logP(W,4;|W,) = log(o(Y,%y,,)) + Y) Log = o(%,%y,)) 
k=1 
K 
= logo(VwYwv,,)) + > log( — Oy Vw.) 


k=1 


CBOW Model 


Opposite of the Skip-Gram model we have CBOW (continuous bag-of-words) 
model where for a few context words we try to find the centre word. 


We learn these matrices 


— W, 


h w, 


Wi} 


] 


Wi4t 


Wi42 
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The image above shows for a window size of 2 and context words 
W,_25 Wp_1> Wi 1> W42 We are trying to find the centre word w,. 


This architecture is called a bag-of-words model as the order of words in the 
context is not important. 


The working mechanism of CBOW can be mathematically represented as 
follows: 


e Let Wo Vv 


W»? 


... vy be the one-hot encoded vectors of the context words, 
W1,W,.--Wy- 

¢ Let W, be the weight matrix connecting the input layer to the hidden 
layer, and W, be the weight matrix connecting the hidden layer to the 
output layer. 


e Leth be the hidden layer, which is the average of the input vectors, 


h=~Divy, 


e Let y be the output layer, which is the probability distribution over the 
vocabulary, y = softmax(W,*h) 


e The target word is selected as the word with the highest probability in y. 
And we want that word to be w, for a given input sentence. 


The training process for CBOW is to minimize the difference between the 
predicted probability distribution y and the actual target word w,, using a 
loss function such as cross-entropy loss. This process is repeated for each 
sentence in the training dataset, which results in the learned embeddings. 
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Conclusion 


In conclusion, word embedding is a fundamental technique in natural 
language processing that plays a crucial role in representing words as dense 
vectors in a continuous vector space. From its inception with basic methods 
like one-hot encoding and sequence-based methods, to the groundbreaking 
development of wordevec, word embedding has evolved significantly, 
enabling more effective representation of semantic and syntactic 
relationships between words. The shift from sparse, high-dimensional 
representations to dense, low-dimensional vectors has revolutionized the 
field, paving the way for more sophisticated models and applications in 
machine learning and AI. As we continue to explore and innovate in the 
realm of word embedding, it is clear that this technique will remain a 
cornerstone in advancing our understanding in artificial intelligence. 
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Recurrent Neural 
Network 


Overview 


Recurrent Neural Networks (RNNs) have been a foundational architecture 
in the field of sequential data processing, particularly in natural language 
processing and time series analysis. Their ability to maintain a memory 
state, allowing information to persist throughout the sequence, has made 
them invaluable for tasks requiring an understanding of context and 
temporal dependencies. In this chapter we will delve into the high-level 
architecture of RNN. 


Motivation 


Consider a language model where we want to predict the word that comes 


next. 
road 


The brown—) fox —> jumped over the ———> fence 


puddle 


If we want to solve it using a statistical method, then for a given set of words 
W1,W>,...W, We want to compute the probability distribution of the next word 


Wee: 
P(Wy41 |W, Was. ++ Wy) 
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For a sequence of words w,, w>,...w7, the probability of the sequence P(w; ) 
will be as shown below. 


P(w;) = P(w,)P(w2|w,)P(w3|W1, W)----- P(Wr| Wy, Wo, ---Wp_y) 


This probabilities can be simply count based where count(sequence) means 
how many times the sequence is present in the whole corpus. 


Consider the sentence, 


“The quick brown fox jumped over the lazy dog” 


count(thequickbrownfox) 
P( fox | thequickbrown) = ———— 
count(thequickbrown) 


But with the advent of Neural Networks, people wanted to utilise that for 
this type of problem, as Neural networks offer several advantages over 
traditional statistical approaches, particularly in tasks involving complex 
patterns and large datasets. 


Let’s look at one Feed-Forward network for a bigram model first where we 
are computing probability of w, with respect to w,_, and w,_, 


P(w,| W,1, W;-2) 
We are taking one-hot vector representation of the words and computing 
word embedding in the Embedding Layer. Then it is fed into a Feed-Forward 


Neural Network. Feed-Forward layer is then connected to a softmax layer to 
compute the probability for w,. 
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Fully connected 


Output probability Vector 


P(w,| W,45 W,-2) 


One-hot vector 
representation 


Embedding Layer Feed-forward 


Neural Network Softmax Layer 


This works fine but have few issues: 


e we need to fix the sequence length to process at a time, like here we fixed it 
to 2. So fixed length can be too small or too large. 


e This limited context window can result in a loss of important long-range 
dependencies. 


e Fixed-size input window can be inefficient and less flexible for varying 
sequence lengths. 


e As the size of the n-gram increases, the number of parameters grows 


exponentially, leading to increased computational and memory 
requirements. 


To resolve these problems RNN came into picture. 
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RNN 


RNNs are called recurrent because they perform the same task for every 
item of a Sequence, where the output depends on the previous 
computations. Another way to think about RNNs is that they have a 
“memory” which captures information about what has been computed so far. 


By design, a RNN cell takes two inputs at each time step: 


¢ an input (;) 
e anda hidden state (h,) 


The left side of the above diagram shows a notation of an RNN and on the 
right side we see the RNN being unfolded into a full network. By unfolding 
we mean, if the sequence we care about is a sentence of 5 words, the network 
would be unfolded into a 3-layer neural network, one layer for each word. 


Seq2Seq Model with RNN 


A sequence-to-sequence model is a model that takes a sequence of items 
(words, letters, audio etc.) and outputs another sequence of items. 
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Consider a neural machine translation, where a sequence is a series of 
words, processed one after another. The output will also be a series of words 
(Consider translating an English sentence to a German sentence). 


So for this we can use RNNs in an encoder-decoder architecture, where the 
encoder and decoder tend to both be recurrent neural networks. 


Sequence-to-sequence Model 


Game} 


Encoder is supposed to generate a context vector, which will be fed to the 
decoder. We can set the size of the context vector. 


<start> Er will 


will 


Sin ile aa 
wants to come now <end> 


<start> He 
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Problems with RNN 


e These RNN based architectures have a great limitation when working 
with long sequences. The context vector is responsible for encoding all the 
information of the source sentence into a vector of few hundred elements, 
which is a challenging task. 


e In the encoder, the hidden state in every step is associated with a certain 
word in the input sentence, usually one of the most recent. Therefore, if 
the decoder only accesses the last hidden state of the encoder, it will lose 
relevant information about the earlier elements of the sequence. 


er will jetzt kommen 


r 


he wants to come now 


e consider the translation problem, where we are converting “He wants to 
come now” in English to “er will jetzt kommen” in German. Now, in 
German what should be the output after ‘will’? should it be ‘jetzt’ or 
‘kommen’. It’s not actually word to word translation where just first word 
in the input will be the first word in the translated output. Similarly, the 
output might have different number of words than the input sentence. 
Normal encoder-decoder does not know which part of the sequence it has 
to pay attention to. 


e RNN based encoder-decoder is slow as it processes each input word one at 
a time sequentially. 


e Prone to vanishing and exploding gradient. 
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Conclusion 


In conclusion, Recurrent Neural Networks (RNNs) have been a cornerstone 
in the development of sequential data processing models, providing a 
framework for understanding and leveraging temporal dependencies in 
data. Despite their success, RNNs suffer from limitations such as difficulty in 
capturing long-range dependencies and vanishing/exploding gradient 
problems. These limitations have led to the development of more advanced 
models like LSTMs and GRUs, which address some of the shortcomings of 
traditional RNNs. However many problems remained unsolved. To deal with 
those problems, a new concept were introduced known as ‘The Attention 
Mechanism”. 


As the field continues to evolve, it is important to recognize the 


contributions of RNNs and their role in laying the foundation for future 
advancements in sequential data processing. 
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P4 


Attention Mechanism 


Overview 


The attention mechanism was introduced by Bahdanau et al. (2014) to 
address the bottleneck problem that arises with the use of a RNN based 
encoder-decoder architectures. The attention mechanism offers several 
advantages over traditional Recurrent Neural Network (RNN) based 
models, making it a preferred choice for many modern natural language 
processing (NLP) tasks. In this chapter we will explore the fundamentals of 
“Attention Mechanism”. 


Motivation 


Consider the example, where we are translating “I am learning” in English to 
“Ich lerne” in German. 


The idea behind attention mechanism is for each output token we want to 
have a context vector that is computed from all the input tokens. Also, when 
generating any word the model can decide which parts need attention. Note 
that the model will not just mindlessly align the first word at the output with 
the first word from the input. It actually learns from the training phase how 
to align words in any language pair (English and German in our example). 


If we see in this image here, while generating the output word ‘Er’ decoder 
looks at the context vector cl which is computed from all of the input words. 
Similarly for ‘will’, decoder will look at the context vector c2, and same for 
each output word. 
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Er will 


<start> Er 


+) Wy 
- Gl aa 
<start> He wants to come now <end> 


In this chapter we will see how we compute such context vectors. 


Data-Fitting Problem 


First, let’s consider a data-fitting scenario. We are given a training data of n 
instances comprising features and their corresponding target values 


{(k,, V1), (>, Vo), ---(K,» V,) }- 


Also, we are given a new query instance g, for which we want to predict the 
target value v. 


A naive approach will predict the simple average of target values of all 
training instances 
| n 
— 


But, here we are not considering which (k,, v;) should have more influence 
while predicting v. 
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Kk, kp Kk Ky 
A Query (9) 
better 
approach will be to take a weighted average where weights correspond to 
relevance or similarity of the key instance to the query g. 


1 n 
v=—) a(g.k)y, 
al 


nN * 


Here weighting function a(q, k;) encodes the relevance of instance k; to 
predict the target for g. In the data-fitting image above we see g is closest to 
k, and farthest from 4, (considering euclidean distance for simplicity). So to 
compute the value of g, we should pay most attention to k, and least to k,. 


Now let’s first compute the similarity between g and each of k; as sim(q, k;). 
The k; which is more similar with g, will have more influence on the target 
value v. Hence, we can take a softmax function on the similarity values and 


then multiply the softmax outputs with corresponding v,s to decide v. This is 
clearly shown in the next image. 
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softmax 


sim(q, k;) sim(q, k>) sim(q, k3) 


Lr | 


Data-Fitting Idea tn Attention 
Mechanism 


q 


Let’s see how we can apply this above idea in our sequence-to-sequence 
problem. 


- Consider the input sequence of tokens X = {x,,x5,....x;}, where T is the 
length of input sequence. 

- Encoder encodes it into fixed length vectors {h,,h5,...h7}. 

- The decoder is responsible for generating output {y,, y,...y,-} token by 
token, where T’ is the length of output sequence. 

- Decoder hidden states are {q),q,...q7'}. 

- Our aim here is to allow the decoder to access the entire encoded input 
sequence { {h,,,...h, } while generating the context vector. 
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The query state here will be g,_, the hidden state of the decoder just before 
emitting current decoder output g, and y,. Keys are all the encoder hidden 
states {h,,h,,...h7}. 

Here for each decoding position t, we will want the context vector c, to be a 
weighted sum of all hidden states of the encoder and their corresponding 
attention weights, i.e. 


predicted 
word’ 


Yr-1 


Decoder 


Encoder 


Input X: x X> x3 X4 


) 


I bought few books 


Here a is the attention matrix where a, for any h; and y, will determine how 
relevant is h; for the output y,. These attention weights are generated over 
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the input sequence to prioritize the set of positions where relevant 
information is present for generating the next output token. 


So first we need to generate these weights a S which is called the attention 
weights. But how do we get these? 


The attention weights are automatically learned by adding an additional 
attention block (feed forward neural network) within the architecture. It 
captures the relevance between h, (the encoder hidden state) and g,_, (the 
decoder hidden state from previous time step). There are two steps in this 
process. 


Alignment Function 


First we compute the alignment probability of the target word y, with input 
x; It is a function of two states, 


e encoder hidden state, h 
¢ decoder hidden state, g,_, 


This function is called the alignment function as it scores how relevant is the 
encoder hidden state h, for the decoder hidden state q,_;. This alignment 


function outputs energy scores ¢;,. 


Cy = aq,-1> h;) 


There are different alignment functions possible which will determine the 
similarity between two vectors, like dot product, cosine similarity, scaled dot 
product etc. [Refer to the Extras section if you need a quick overview of vector similarity 
functions ] 
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Probability Distribution 


The outputs of alignment function are then fed into the distribution function 
(denoted by p) which converts the energy scores into attention weights. The 
most commonly used distribution functions are logistic sigmoid and 
softmax. These functions ensure that 
attention weights are constrained in [0,1] 
and sum to 1. Such weights can thus be 
interpreted as probabilities that determines 
the relevance of the element. We use 
softmax function, to generate attention 
weights or probabilities. 


exp(e,) 
ay = Peg) = =p —__ — 
D4 expe) 


Attention Block Training 


We train the attention weights a; 8 during training time. While training these 


a weights are trained using decoder previous outputs and all the encoder 
states. Also the encoder parameters are learnt during training phase. 
Context vectors are generated accordingly. 


After the training is over we get our attention weights trained properly and 
that is fixed now. So while testing with new sequence we will be using the 
same attention weights to generate the context vector for each decoder 
output. If training is proper context vector will be generated properly, hence 
each decoder output for next word will be more perfect. 


Now, our sequences can be of any variable length T, but to train attention 
matrix we need to fix the maximum length of the sequences. 
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Advantages of Attention Mechanism 


Here are the key reasons why attention mechanisms are beneficial: 


Parallel Processing with Attention: 
The attention mechanism can apply parallel processing on all input 
tokens, significantly speeding up training and inference times. 


Handling Long-Range Dependencies: 
While generating output, Attention mechanism captures contexts 
from the whole input and can directly relate any two words in the 
input sequence, regardless of their distance 


Improved Focus on Relevant Parts of Input: 
The attention mechanism dynamically weights the importance of 
different parts of the input sequence while generating the output 
sequence, which allows the model to focus on the most relevant 
information for each output step. 


Enhanced Performance in NLP Tasks: 
Attention mechanisms, particularly when combined into 
architectures like the Transformer, have shown superior 
performance across various NLP tasks, including machine 
translation, text summarization, and sentiment analysis. 


Scalability and Flexibility: 

Attention mechanisms, particularly in transformer models, scale better 
with increased data and model size, making them more suitable for modern 
large-scale NLP applications. 


Conclusion 


In summary, the attention mechanism has revolutionized the way we 
approach sequence processing tasks in artificial intelligence. By allowing 
models to focus selectively on different parts of the input sequence, 
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attention mechanisms enhance the model's ability to capture relevant 
contextual information, thereby improving performance across a wide range 
of applications. Unlike traditional approaches that treat all parts of the input 
equally or rely on fixed-size windows, attention mechanisms provide a 
flexible and dynamic way to weigh the importance of various elements based 
on the context. This adaptability not only boosts the accuracy of predictions 
but also opens up new possibilities for interpreting and understanding model 
behavior. As we continue to refine and extend attention-based methods, 
their impact on fields such as natural language processing, machine 
translation, and beyond will undoubtedly grow, solidifying their status as a 
cornerstone of modern AI technology. 
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Self-Attention 


Overview 


This chapter delves into the mystery behind self-attention, highlighting how 
this mechanism enhances model performance, efficiency, and 
interpretability. By understanding these advantages, readers will gain a 
deeper appreciation of why self-attention is a foundational component of 
modern AI architectures like transformers. 


Motivation 


The motivation for self-attention arises from the need to capture 
dependencies between all elements in a sequence, regardless of their 
distance from each other. Unlike recurrent neural networks (RNNs), which 
process sequences in a step-by-step manner and struggle with long-range 
dependencies, self-attention mechanisms allow for direct connections 
between any pair of elements in a Sequence. This capability enables models 
to efficiently understand context over long distances, leading to more 
accurate and coherent representations. 


Moreover, self-attention facilitates parallel processing, as it does not rely on 
sequential computations. This parallelism significantly accelerates training 
and inference times, making it feasible to handle large datasets and complex 
tasks. By addressing these critical challenges, self-attention has become a 
fundamental component in modern transformer architectures, driving 
advancements across various AI applications. 


So, in Self-Attention we will just be looking at the input sentence, not 
thinking about output. So before even going for decoding in the input 
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sentence level we need to see whether a pair of words are relevant or not. 
Let’s look at two sentence here, 


- “The animal did not cross the road, because it was too wide” 
- “The animal did not cross the road, because it was too tired” 


In the first sentence the adjective ‘wide’ refers to the road, and in the second 
sentence the word ‘tired’ refers to the animal. Also, the word ‘it’ refers to the 
road and animal respectively. So we need to find out which word pairs are 
more relevant with each other. 


low attention 


high attention 


igh attention 


The animal did not cross the road because it was too wide 


low attention 


high attention 


low attention 


low attention 


The animal did not cross the road because it was too tired 


high attention 


So basically, when the model processes each word or each position in the 
input sequence, it interacts with other words at other positions in the input 
sequence to find out who they should pay more attention to. It helps to get a 
better encoding for that word. When the model will process the word “it”, 


self-attention will allow it to associate with either “animal” or “road”. 
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How Self-attention works? 


Consider the word embeddings of the input sequence are {x,,x5,...x,}. The 
directed weighted fully-connected graph shown below where vertices are 
inputs and each edge with weight Qi determines how much importance input 


x; has for x,. 


Are (64 ON 


b. (4) 

J 
Self-Attention as a directed, weighted, 
fully-conntected graph (all edges are not shown) 


Now, the number of output is Same as number of input here as we are dealing 
with single input sequence. Let’s look at the i-th output y, which should be a 
weighted combination of all inputs. 


i= » jx; 
J 


Self-Attention 


Input | oe L 


thick lines for higher attention 
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Or in matrix representation we can write, Y = AX, where A is the attention 
matrix. But how do we generate the attention matrix? What are query, key 
and values here? 


Kach of the word x; in the input sequence will become a query. For each x,, all 
other x, s will contribute in generating keys and values, where j = {1,2...,n}. 


So basically each word in the input should generate a query, key and value 
vector. 


a1. ky AI q2, kp 2 do, ko 9 Q12 k12V12 


The animal did not cross the road because it was too wide. 


As we saw in previous chapter, a, can be written as pla; X;))s where p is the 
probability distribution softmax and a is the alignment function. 


exp (Xj %)) 


a; = p(a(X;, X%})) = using dot-product as alignment function 


Dp OxP OG Xe) 


AS y; = » a,,;X;, We Can write as shown below. Query, Key, value is also shown 
J 


here. 
query ney value 


ae 


J 
y= » P(A(X;, X)))%; 
j 


Now, should we use the word embeddings x,, Xx; 8 directly as query, Key or 


value vector? Or can we use some other functions to transform them into 
different spaces? 
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Consider the two simple sentences here. 


- An apple and an orange a day keeps you 
healthy. 

- Apple launched their new iPhone. 

In the first sentence apple means a fruit, and in 

the second one it refers to a company. 

Consider the embedding of the words in the 

shown 2Dspace. Orange and iPhone is quite 

distant from each other. But we don’t know 

where to put ‘apple’ without knowing the context. 


But given the context, the embeddings will improve and “apple” will move 
towards either orange or iPhone. 


Now our aim is to have a space where the relevance and irrelevance of two 
words are greatly captured. Consider the three linear transformations 
below, and think which transformed space serves our purpose better? 
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Linear transformations of the original embedding 


Clearly, the first one does oKay, but the second one is not able distinguish 
between ‘apple’ as a fruit and ‘apple’ as a company. The third one however 
does a great job as shown in the picture. 


ae 


So we got some intuition about why transforming the x, s and projecting ina 
different space is a good thing to do. But how do we do the transformation? 
Let’s say we use @,, ¢,, p, as the transforming functions. 


query key value 


y= >! Plat DXi)» PxlX)) )) POH) 
J 
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The simple way to transform a vector is to use a matrix as a transformation 
matrix. Let’s take three matrices Wo, W, and W, for the transformation and 


this matrix transformation can be seen as the po, bx, dy functions we just 


discussed. We will be training these matrices in the self-attention block of 
the transformer architecture. 


Query g;= Wox, q,E R& ,Wo © ROX (fori © {1,2,...T}] 
Key k= Wee: k, E R&, We © ROX [for i € {1,2,...T}] 
Value v,= Wyx, y,ER*%, W, ER [fori e {1,2,...T}] 


The reason for taking query and key to the same dimension is we want to 
take dot-product of them (Considering we will be doing dot-product 
attention). 


Let’s say we packed our word embeddings(, s) into a matrix X. Multiplying 
X by weight matrices( Wo, Wx, W, ), we will get our query, key and value 
matrices (Q, K, V). 


In these matrices QO, K and V respectively we now have all the g,k and v 
vectors. 


O=XW,,K=XW,,V=XW, andX € R’4 n= input length 
Q K V 


O E Rea, Ke RE, Ve Rid, 
Q = {1>9a5+++»9r7} 
K = {k,,ko....,kpt 


V= {V1,V5,---,Vr} 


Let’s look at a visual representation of these in the next page. 
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The remaining procedure we will discuss in detail in the context of Scaled 
Dot-product attention, as it is used in the Transformer architecture. 
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Scaled Dot-Product Attention 


We will be discussing this special type of attention called "Scaled Dot-Product 
Attention”, that is used in Transformer architecture. Dot-product attention 
is very fast and space-efficient in practice, since it can be implemented using 
highly optimized matrix multiplication code. The steps are shown in the 
below image. 


Scaled Dot-Product Attention 


MatMul: 


This first step effectively maps the queries to their corresponding keys. In 
this step we compute the dot-product OK’ which produces a score or 
compatibility matrix. The score matrix establishes the degree of emphasis 
each word should give to other words. Therefore, each word is assigned a 
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score in relation to other words within. A higher score indicates we need 
more attention. 


(QK" rr 


Scale: 


Scale the Dot-product by dividing them by fd : where d, is the dimension of 
each query or key. This step is needed as when d, becomes large, the 
variance of dot-product qik; will become large, which leads to vanishing 
gradient problem. While for small values of d, this scaling does not perform 
that better, but for larger values of d,, it outperforms other attention 
methods. 


Scaled score matrix 
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Mask(Optional) 


This is an optional step required in few cases. We will discuss about it when 
we will discuss Transformer decoder. 

Let’s say when we are generating a new sequence, few words we have 
already generated and few are yet to generate. So while doing self-attention 
we will only use the past word positions and mask the future word positions. 


Scaled scores Mask Masked scores 


We do that by setting the upper triangular section of the matrix to —oo. So 
that when in the next step we will apply softmax, then the upper triangular 
values will become zero. 


Softmax: 


Subsequently, a softmax function is applied to normalise the adjusted scores 
to obtain the attention weights. This results in probability values ranging 
from O to l. 

The softmax function emphasizes higher scores while diminishing lower 
scores, thereby enhancing the model's ability to effectively determine which 
words should receive more attention. 


Higher scores are emphasized 
and lower scores are depressed 
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Combine Softmax Results with the Value Vector 


In the next step we multiply the weights or scores derived from the softmax 
function by the value vector, resulting in an output vector. In this process, 
the values of the words that have high softmax scores are preserved. 


Vind, 


scores values Output 


Finally we will get the output of self-attention layer and combining all steps 
we can write as shown below 


_ 
fly, 
Jd, HHH = 


Q 


softmax 


Limitations of Self-Attention 


While self-attention is a powerful mechanism that has revolutionized 
natural language processing and various AI tasks, it is not without its 
limitations. This chapter will delve into the constraints and challenges 
associated with self-attention, providing a balanced perspective on its 


strengths and weaknesses. By understanding these limitations, readers can 
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better appreciate the innovations and design choices that have been made to 
address them, such as multihead attention and other enhancements. 


Computational Complexity 

Self-attention mechanisms require calculating attention scores between all 
pairs of input tokens, leading to a quadratic complexity with respect to the 
sequence length. This makes self-attention computationally expensive and 
memory-intensive, especially for long sequences. 


Lack of Explicit Positional Information 

Unlike RNNs, which inherently process data sequentially and maintain 
order, self-attention mechanisms do not natively capture the order of 
tokens. 


Difficulty in Handling Very Long Sequences 

As the sequence length increases, the memory requirements for storing 
attention matrices grow significantly, making it challenging to handle very 
long sequences efficiently. 


Conclusion 


In conclusion, self-attention has proven to be a transformative innovation in 
the field of neural networks, especially for tasks involving sequential data. 
By enabling models to dynamically weigh the importance of different 
elements within a sequence, self-attention mechanisms captures intricate 
dependencies and contextual relationships that traditional models often 
miss. 


Self-attention's parallel processing capability further enhances its efficiency, 
making it well-suited for modern computational environments and large 
datasets. It allows models to be trained faster and more effectively, leading 
to better performance in a wide range of applications, from machine 
translation to text generation and beyond. 
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Positional Encoding 
Overview 


In this chapter, we will explore the concept of positional encoding and its 
critical role in transformer architecture. Unlike traditional sequential 
models like RNNs, which inherently process data in order, transformer 
models rely on self-attention mechanisms that do not have a built-in sense of 
sequence. Positional encoding addresses this limitation by providing the 
model with information about the position of tokens within a sequence, 
allowing it to capture the order and structure necessary for effective 
processing. 


Motivation 


RNNs had a notion of time induced, we used to process input words one at a 
time. But in case of self-attention we are not considering time steps. For each 
input word we are creating a context vector considering all the input words, 
and there is no notion of ordering in the words. 


Consider the input word “I loved the food but not the ambience”. What naive 
self-attention sees is just a bag of words. 


Hee not food 


ambience 
the 


- loved 
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So we can take any permutation of the words and we will get the same 
attention weights. 


“I loved the ambience but not the food” (different meaning) 
“I ambience but the loved the not food ” (no meaning) 


This behaviour is actually different than the RNN architectures as in RNN 
we process one word at a time, so the model tries to remembers what words 
it saw before, and therefore can remember their order. 


Also consider one more example: 


“Cinnamon comes from the back of the Cinnamon tree” 


Here first and second cinnamon has completely different meaning. 
First one refers to the spice and the second one refers to the Cinnamon tree. 
Should both of the Cinnamon should have the same encoding? 


How do we handle this? If we want to apply attention, the query, key and 
value for both of them will be same and so the attention weights. 


As we are not processing tokens sequentially over time, can we at least add 
the time or position information to the word embedding itself? 


The solution here is to add a position vector of same size with the word 
embedding. And then compute query, Key, values. This positional vectors 
PE(t) depends on the timestamp t, but the question is how do we get those 
positional encoding vectors? 


Consider one more example sentence.. 
“The King supported the plan but the Queen didn’t” 


Consider the word “King” and “Queen” in multidimensional space. “King” 
and “Queen” will be close to each other because of their semantic similarity. 
But here “King” comes in first position, whereas “Queen” comes in eighth 
position. So we will add the positional encoding to the vectors to move it a 
little bit because we want to capture the word ordering. So in the next 
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— >! 
king ge? ae 
king f 
Queen 
Eighth position 
Queen this way 


picture we see King and Queen shifted a bit in some other dimensions after 
adding the positional embedding or encoding and produces vector King’ and 
Queen’. 


After adding the positional encoding “King” will move little bit to cluster 
with all the other second words in any other sentence in the corpus. 
Similarly “Queen will try to club with other eighth words in other sentences. 


Properties required for the Positional 
encodings 


e Every unique position should have unique vector (should be 
unambiguous). 


e Every position should have the same positional vector irrespective of the 


sequence length and the input at that position. So, the sequence might 
change but positional embedding will stay same. 
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king’ 


Queen’ (33) 


king . 


Queen 


e As the word embedding vectors are shifted a little bit, we don’t want the 
shift to be too large, otherwise they push the vectors into very, very 
distinct subspaces where the positional similarity or dissimilarity 
overtones the semantic similarity. 


e Should work with sequences longer than encountered during training. If 
we trained the model with max sequence length of 512, even then the 
function should work for sequence of length 513 or more. 


e Should be of same size as word embedding. This decision came because in 
transformer architecture we decided to add the word vectors with 


positional encoding vectors. 


e We should be able to estimate distance between tokens. 


How to get these positional encodings? 


The main challenge now here is how do we get the positional encodings? So 
for any position pos the word embedding for that position is x,,,. We want to 
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generate PE(pos), positional encoding for position pos, that will be added to 


Jas: 


Can we uSe linear non-periodic function? 


Let’s consider a linear non-periodic function to generate the positional 
encodings. Here no matter whether we decide to add or concatenate the 
positional encoding vector to our word embedding vector, the shifts in one 
dimension or the other will increase a lot for longer sequence lengths. So 
we are violating the requirement where we said that shifts should be small, 
in other word bounded. 


7 
6 
5 
4 
3 
2 
1 
“ *2 *100 
[053 | 019 [023 | 099 | 002 | (ost | 033 [or [oa [oor | 
i -  weeaeenaee + 
eee) eee Ee ee ee ee 
PE(1) PE(2) PE(100) 


Can we normalise the values to fit in a range of let’s say 
Oand 1? 


Can we use a bounded function like 
sigmoid? 


The values are bounded between O and 
1 in sigmoid function. But we need some 
amount of variability between the 
consecutive positions. 


It does not help there. 


Can we use any periodic 


trigonometric function? sal 


Yes, we can use Sines and cosines 
as they are periodic and have lot of 
variability also for big numbers. As 
the values are bounded by 1 and -l, 

So even for an enormous sequence 4.45 / — eam 

length we will get values between -1 0 2 4 6 8 10 
and 1 as elements for our positional 

encoding. 


[Refer to the Extras section if you need a 
quick overview of periodic functions ] 


Now let’s use a Single dimensional 
sine function as our PE(pos). The 
problem here is numbers are going 
to repeat after some point of time. 


Encoding feature value 


— sin(x) for d=0 


Can we use sine function with very low 
frequency, such that its values don’t 
repeat frequently! !! 

Well, in that case we we get almost the 0.6 
same situation as the linear equation. 


1.0 + —— sin(x/10) for d=0 


Now let’s use two dimensional vector for 
one position, where for first dimension 
let’s use a Sine and for the second dimension let’s use a cosine function. 


T T T T T T 
0 2 4 6 8 10 


T T T T T 
0 2 4 6 8 10 


Peto = [on 
_ jsinO)| _ 10 _ jsin(l)| _ |0.84 
Cee Cae 7 "|. oc ine 7 beri 


These two dimensional position vector will not repeat soon, but will repeat 
eventually. 


How about we add more dimensions of sines and cosines? Can we add 
dimensions with varying frequencies of sines and cosines? 
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As the frequency decreases, the distance between the peaks of the sine wave 
graph increase. So if we add dimensions with sines and cosines with lower 
frequencies for higher dimensions, then those dimension values will not 
change frequently. 


sin(x/2) 
--- sin(x) 
— sin(2x) 


Also, if we keep alternating between sines and cosines with decreasing 
frequency, we can give enough information to ensure that the model cannot 
miss the order of the sequence. 


1.00 


sin(x) 
—0.50 cos(x) 

sin(0.5x) 
~0.75 cos(0.5x) 

sin(0.3x) 
~1.00 cos(0.3x) 
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So if we can use the sine cosine only but with a gradually decreasing 
frequency for higher dimensions, then for higher dimensions the values 
wont change frequently and we will get unique encoding vector for positions. 
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Consider the binary encoding for numbers. Numbers are represented in 
binary 4 digit format. First column represents zero with all Os (OOOO), 
second column is one represented by (OOO1) and so on. 


OL2345 wiarrsins 15 


Oth bit 


0001 1st bit 


2nd bit 


3rd bit 


0 
1 
2 
3 0011 
A 
5 


In the right image above, white blocks are for Os, and black for ls. 
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Here the higher bits in binary encoding does not change quite often for 
numbers, whereas lower bits change frequently. But this type of encoding 
uniquely identifies the numbers. We are trying to achieve somewhat similar 
encoding for our positional vectors. 


So, let’s Say we use one positional encoding vector like below with dimension 
dim as the same dimension of the word embedding vector. 


sin(freq) 
cos(freq1) 
sin(freq>) 


PE(pos) = cos(freqg) where freq, decreases over increasing k, 


sin(freq Ain /2) 
cos(freq ah /2) 


The frequency freq, also needs to be a function of the position pos. In the 
“Attention is all you need” paper they came up with the below formula for 
the frequency: 


PE( 21) = sin( ca ) 
os, 2i) = sin(—————— 
y 100002i/dim 
PE(pos, 2i + 1) (a 
os, 21 = cos(———————_ 
Py 100002i/aim 


For even dimensions, we will use the sine formula and for odd dimensions we 
will use the cosine formula. Otherwise we can write it as shown below, 
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Position 


ack 
sin() |40 we 
A | M of? 
posy | det oe ge 
cos( i. oe , es 
sin = d=2 
2 
pos, | 4 
PE(pos) = cos(>— d=3 
(Pes where J, = 1000074" 
sin Jama? 
po 


For /,, i always increases over dimension d, hence 10000" also increases 


pos 


d ——— decreases for higher dimensions. 
100002i/dim 


We can visualise the whole idea to understand better. Below is an image of 
positional Encoding values for first 20 positions, generated with the 


10 
075 
050 
025 
0.00 
0.25 
-0.50 
“0.75 


” eenbedding Dimension 


So 


uof Tensorflow - Positional encoding code. We see for higher dimensions the 
values don’t change easily. 
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Positional encoding vectors created this way satisfies our criterias. 


e All vectors are unique, and every position has the same positional vector 


irrespective of the sequence length and the input at that position. 


e If we add the position vectors to our word embedding vector, shifts won’t 


be too little or too high. 


e AS sines and cosines are periodic functions our model will be able 


generalise this for longer sequence length than what was encountered 


during model training. 


e Here positional encodings are linear transformations of each other. Will 


just quickly touch this part. 


Consider two positions pos, and pos + t 


pos . post+t 
sir auc Ay ) 
pos pos+t 
ee coe Ay ) 
pos . pos+t 
sin(—) sin( ) 
“) 29) 
PE(pos) = cos) PE(pos + t) = es 
Ap Ad 
sik Ge si at 
Adim/2 Adim/2 
cos( = cose a e 
AdimI/2 Adim/2 


We can apply trigonometric expansion on PE(pos + t) and see how it goes. 
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in( ) es Saarpet ach Were ae 
ii sin( i aoe ) + cos( 7 sin) 
pos +t pos t _ POS... ¢ 
Os = —S ae —— 
( rv ) cos( i; Jeostz) sin( i; sins) 
_ pos +t . _pos t pos... ¢ 
in( ce ) sin( cP )cos( i + cos( is )sin( i 
PE(pos +t) = | cose = ‘) = | cose )cos(-Ly = sine) sin by 
29) A9 49 Ago AQ 
) _ pos pos .. 
sin — af sing Jeos(= y+ cos(= sins ) 
Adim/2 dim/2 dim/2 dim/2 dim/2 
os-— us ‘) cos( = )cos( ) — sin( aie )sin( 
Qdim/2 Adiml2—- Adim/2 Adiml2— Adim/2 


After few more expansions it can be shown that, PE(pos + t) can be 
represented as A*PE(pos), where A is Some matrix not dependent on pos, and 
* denotes matrix multiplication. Which means PE(pos + t) is a linear 
transformations of PE(pos) and these two are linearly dependent. 


e Also, if we look at a0 
different frequencies, we d=! 
can estimate distance 
between two positions, 
which is helpful if the 
model tries to 
understand the relative 
positions of the words. d 


IIDINEPOES! 


| es ee eee ee ee ee ee ee ee ee 


d=dim 


S 2 
- = = N Nii Nn NN HHH HH 
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Conclusion 


Positional encoding is a foundational component that complements self- 
attention mechanisms, ensuring that transformer models can effectively 
process and understand sequential data. By encoding the positions of 
tokens, this technique enables the model to maintain the order and 
structure necessary for accurate context understanding and performance 
across various tasks. 
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Multi-head Attention 


Overview 


Multi-Head Attention is an extension of the attention mechanism, 
particularly the scaled dot-product attention, designed to enhance the 
model's ability to focus on different parts of the input sequence 
simultaneously. 


So we already saw how we are getting the word embedding, and then we are 
applying self attention on top of that using our key, value and query vectors. 
In chis chapter, we go one step further and discuss about Multi-head 
attention, and how it enhances the performance of transformer models. 


Multihead attention is a critical component of the transformer architecture, 
significantly contributing to its effectiveness and versatility. By breaking 
down the mechanics of multihead attention, we aim to provide a clear 
understanding of its role in processing complex data patterns and improving 
model accuracy. 


Motivation 


When we learnt about word embedding, we saw that the dimensions of the 
embedding maybe denote some aspect of the input. 


Consider the examples we already saw in one previous chapter. Here we are 
assuming the four-dimensional word embedding vectors for the words 
“King”, “Queen”, “Man” and “Woman”. We are considering four attributes 
power, rich, technology, gender in a four-dimensional space. 
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King and Queen both have power and will be rich, hence have higher values. 
Nothing is related to technology here hence it is O for technology attribute. 
Gender is denoted as -1 for male and 1 for female. These attributes are for 
our understanding, computers can find out some random set of attributes. 


Here each of the four dimensions talk about some specific feature. However, 
computers can’t understand such features like these, this is just for our 
understanding, but computers will have their own unique numerical 
representation of some aspects that we have no clue about :(. 


rw 09 OBC 
™m 0.89 0.85 0.45(0.38 
Technology 0 0 0 0 
=) 1 1 . 1 


In self-attention we saw how we were having one set of Wx, Wo, W, matrices 


to generate key(k), query(q), value(v) vector for a word vector. These helps 
to identify the relevance or similarity between two word vectors. But let’s 
think more deeply. When we are trying to find similarity between King and 
Queen, then “Technology” is an unnecessary feature. If we consider Man and 
Woman, then only “Gender” is enough to distinguish them. 


So what if we divide the word vector into multiple fixed number(n) of chunks 
as Shown below!!! People also calls each chunk as head. 


ee ih eS es) 2 ee 


[TT] a CLT) (| (|) 
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We will apply self-attention on the corresponding chunks. For each chunk we 
will have different query, Key, vectors. Like for the example here, we will 
apply self-attention with all the blue coloured parts of different word 
vectors, and the same for yellow and red parts. Then we concatenate the 
results. 


And why do we do that? To extract more finer details. As we will be using 
multiple transformation matrices and multiple vectors for each word, we 
will be extracting more information. So let’s create multiple query, key, 
value generated for a word embedding x;. 


We have already seen the steps of Scaled Dot-Product Attention, now let’s 
discuss the steps of Multihead-Attention. 


Steps 


This mechanism allows the model to jointly attend to information from 
different representation subspaces at different positions. 

Here's a step-by-step explanation of Multi-Head 
Attention: 


step 1 


First we need to decide how many heads or 
chunks we will have. Let’s say the number of 
heads is h. Typically for Transformer 
architecture this value is set to eight. So we 


will have that many query, key, value vector 
for a word. 
Q 


Scaled 
Dot-Product 
Attention 


K 
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D 


yD) ? y 
step 2 


Next we will perform Scaled Dot-Product Attention for each head with 
different Query, Key and Values. 


oO Ka yo Q? = K® ve) 
Ed _ 
head’ head 2 
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Step 3 


This leaves us with a bit of a challenge. We don’t want / matrices as output, 
rather we expect a single matrix (a vector for each word). So we need a way 
to condense these h matrices down into a single matrix. 


How do we do that? 


d 
We have h head outputs Z,, Z,,...,Z,, each of dimension Tx. We 


concatenate them and get output Z. 


Txd 


Step 4 


Then we multiply Z’ with a weight matrix W° which is also trained during 
our training phase. And finally we get our output as 


MultiHead(@,K,V) = Z = Z' x We 


where W? € R4*4nodet 
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Conclusion 


Multihead attention is a cornerstone of the transformer architecture, 
providing a significant boost to the model's capability to process and 
understand complex data. By utilizing multiple attention heads, this 
mechanism allows the model to attend to different parts of the input 
sequence from various perspectives, capturing diverse features and 
relationships. Also, Multihead attention allows for parallel processing of 
multiple attention operations, enhancing computational efficiency and 
making the training process faster and more scalable. One important point 
to mention here is that the computation cost of MultiHead Attention is 
almost similar to the Single head attention, but it can extract more details 
because it extract finer details from different subspace dimensions. 
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Transformer 
architecture 


Overview 


The Transformer architecture was introduced in a paper titled "Attention is 
All You Need" by researchers from Google, including Ashish Vaswani, Noam 
Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz 
Kaiser, and Illia Polosukhin in 2017. This model architecture had significant 
improvements in efficiency and performance for tasks such as natural 
language processing and machine translation. 


Key Innovations 

There are multiple innovations that made Transformers so successful, such 
as, 

e Attention mechanism 

e Self-attention mechanisms 

e Positional Encoding 

e Multihead Attention 

Throughout this chapter, whenever we need we will refer to this example of 


machine translation. 


me encanta el cricket —> Transformer I love cricket 


Input Output 
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Architecture 
Transformer consists of one Encoder and one Decoder stack. Consider a 


machine translation example, where the Translator translates “me encanta 
el cricket” in Spanish to “I love cricket” in English. 


T love cricket Output 


Input me encanta el cricket 


The Encoder stack is made of multiple encoders and similarly decoder stack 
has many decoder blocks. In the official paper of transformers the 
researchers have stacked six of them on top of each other. So we have six 
encoders and six decoders. 


I love cricket Output 


Input me encanta el cricket 
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Inputs 


For a particular time step ¢,, input to the bottom Encoder is word embedding 
with positional encoding. 


Embedding with Positional Encoding 


Positional Encoding 


Embeddings 


Input me encanta 


Whereas input for the bottom decoder is the decoder output from the 
previous time step(word embedding with positional encoding. It starts with 
a <start> token marking the start of the decoder output and ends with <end> 
and end of the output marker. At each time step ¢; we feed the decoder output 
from previous time step ¢,_,. 


Embedding with Positional Encoding 


Positional Encoding 


Embeddings 


Input <start> ft love 
(shifted right) 
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Encoder block 


Encoder stack has six encoder blocks in the Transformer. Each Encoder 
block consists mainly of one self-attention layer and a feed-forward layer. 


Feed Forward Neural Network 


l MultiHeaded Self Attention 
>, Ea ro %LLELTL 


Wi W> W3 


The multi-headed self-attention layer uses lots of matrix multiplications and 
its output is linear. So to introduce some non-linearity into our architecture, 
we add feed-forward neural network. The same feed-forward layer is applied 
independently to each word position. Encoder output bubbles up to their 
upward encoding block. 


Decoder Block 


Decoder stack has six decoders in Transformer. Bottom decoder is fed the 
output of the previous time step from Decoder stack.Each decoder has 
mainly three steps. 


Masked Multi-Headed Self Attention 
The Self-Attention in decoder is Masked Self-Attention. Because when 


decoder generates the output words one after another, it will only use self- 
attention with previously generated words. Future words are not generated 
yet, So we mask those future positions. This is done by masking future 
positions (setting them to -inf) before the softmax step in the self-attention 
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calculation, so that when we apply softmax these future position values 
become Zeros. 


| Feed Forward Neural Network 


Encoder-Decoder Attention 
Encoder-Decoder attention layer is an attention layer that works just like 


multiheaded self-attention except it helps the decoder focus on relevant 
parts of the input sentence. 


The encoder processes the input sequence and the output of the top encoder 
is then transformed into a set of attention key, value vectors k and v(Or 
combined into a matrix let’s call them K and V. These are used by each 
decoder which helps the decoder pay attention to the appropriate places in 
the input sequence. 
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Word Embedding 
with 
positional Encoding 


Previous Output 


For the bottom decoder, previous time step output acts as the query and the 
K, V generated from the last encoder acts as the key and value. Other 
decoders create their query matrix from the layer below it, and takes the 
Keys and Values matrix(K, V) from the output of the encoder stack. 


Each decoder block in the decoder stack bubbles up their decoding results 
just like the encoders did. 


This steps repeat the process until we get a special symbol (let’s say <end>) 
as Output from the decoder stack, indicating that the transformer decoder 
has completed its processing. 

Feed-forward Neural Network 


The Attention layers uses lots of matrix multiplications and its output is 
linear. So to introduce some non-linearity into our architecture, we add feed- 
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forward neural network. The same feed-forward layer is applied 
independently to each word position. 


The Residual and Layer Normalisation 


One detail worth mentioning in the architecture of the encoder is that each 

sub-layer (self-attention, feed-forward layer) in each encoder has a residual 
connection around it, and is followed by a layer-normalization step. Why do 
we need residual connections? Residual connections help deeper models to 

overcome the vanishing and exploding gradient problems. 


Residual Connection 


Encoder Block 


Residual Connection 


Inputs 


This goes for the sub-layers of the decoder as well. 
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Residual Connection 


Residual Connection 


Decoder Block 


Residual Connection 


The Final Linear é Softmax Layer 


The decoder block outputs a float vector. How do we get our output word 
from that? We have two more layer to achieve that, one linear and one 
softmax layer. 


Linear Layer 
The Linear layer is a fully connected neural network that projects the 


decoder block output, into a much, much larger vector called a logits vector. 


What should be the size of the logits vector? 


It should be the size of the total output vocabulary size. 
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Let's assume our our model knows 10,000 unique English words , then that’s 
our model vocabulary that it’s learned from its training dataset. So we will 
need our logits vector to be of 10,000 dimension, where each dimension 
corresponding to the score of a unique word. 


Softmax Layer 
The softmax layer then turns those scores into probabilities (all positive, all 


add up to 1.0). The dimension with the highest probability is chosen, and the 
word associated with it is produced as the output for this time step. Here in 
the example in the image, we see the eighth position has highest probability. 
And the word at eighth position is “apple”, hence our decoder block output 
for that time step will be “apple”. 


Vocabulary word with this index apple 


index of the dimension 
with highest probability 8 
(argmax) 


| 
123456789 


zaee 


logits | 


Decoder Block Output 


Final Architecture 


If we're to think of a Transformer of 2 stacked encoders and decoders, it 


would look something like this: _ 
Output probabilities 


Positional Encoding 


Outputs 
(shifted right) 


Inputs 


82 


Training and Loss Function 


Finally let’s look at how we train the model. Like any other neural network 
model architecture, it goes through forward pass and backward pass with 
back propagation. As we are training the model with labeled data, we can 
compare the model output with the actual output. 


Consider the language translation example we used here “I love cricket”. Our 
output vocabulary just contains this words. The output vocabulary is 
created even before we start the training, in the preprocessing phase. One 
end of sentence marker (let’s call it here <eos> ) is also added in the 
vocabulary. 


sta ea diated 


One-hot encoding of the words will look like below, 


Output Vocabulary 
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So in the first time step we want to output “I”, or to be a probability 
distribution indicating the word “I”. But since this model is not yet trained, 


that’s unlikely to happen just yet. Model’s parameters are initialised 
randomly, so model will produce some random output. 


Now we need to compare 
the model output and desired output (may be simply subtract one from the 
other). For mode details please explore the cross entropy Loss. 


Untrained 
model output 


Here we are just considering three word vocabulary, but in reality the 
vocabulary will be very large (may be containing more than 50000 words). 
So out output vector will also be that large. If we train the model properly, 
our output will be close to our desired output. Here we are showing the 
desired and predicted output for each time step or position. 


Desired Output Trained Model Output 
Output vocabulary: cricket I love <eos> cricket I love <eos> 
time step 1 
time step 2 
time step 3 
time step 4 
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At each time step, the model will produce the output word with the highest 
probability. There are actually two way for doing this. 


Greedy Decoding 
The model will produce the output word with the highest probability at each 
time step and throw away the rest of the probabilities. 


Decoding using Beam Search 

Consider at time step 1, the top two words are “I” and “cricket”. So we will 
run the model twice, 

- Once assuming the first output is “I”. 

- Another time assuming the first output is “cricket”. 


Now whichever option produces less error considering both time step 1 and 
2, Will be kept. We will do the same for time step 2 and 3 and so on. Here a 
beam search with beam size 2 is applied. However, this is a hyperparameter 
that we can experiment with. 


Conclusion 


As we conclude our exploration of the fundamentals of transformer 
architecture, it’s clear that we are standing at the forefront of a 
revolutionary era in artificial intelligence and NLP. The journey from 
understanding basic concepts to delving into the intricacies of the whole 
architecture has been both enlightening and empowering. 


Transformers have fundamentally changed how we process sequential data, 
allowing for unprecedented accuracy and efficiency. Their ability to handle 
long-range dependencies and capture complex patterns has opened up new 
possibilities in a wide range of applications, from machine translation and 
text generation to image processing and beyond. 


To further enhance your understanding and provide more value, we have 
included a few extra topics at the end of this book. 
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Extras 


Vector Similarity 


Dot Product 


Dot product can be used as a Measure to compute similarity between 
two word vectors. 


Consider the words banana, orange and mobile and there embedding/feature 
vector as shown below. Let’s just consider two human-readable 
features(electronics, fruit) for simplicity. 


banana and orange has almost similar feature 


» 
_ 
= 
values as both are fruit, whereas feature values of £ 
mobile is totally different. oO 

9) 


banana * orange=1* 2 +8*6=50 banana 
banana * mobile= 1*8+8* 1=16 fa |e | 

orange 2 
orange * mobile = 2*8+6* 1=88 


Banana and orange both are have more fruit value( when multiplied gives 
larger value), and less electronics value (when multiplied gives lesser 
value). Hence, there dot-product is high. 
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Whereas, mobile and banana/orange has either high electronics-low fruit or 
low electronics-high fruit combination for dot product. Hence, their dot- 
product is giving low value. 


Cosine Similarity 


Cosine similarity actually calculates the cosine of the angle between the 
word feature vectors. The higher the angle the lower the similarity. We have 
plotted the banana, orange and mobile in the 2D vector space here and let’s 
see their cosine similarity now. 


frit ==> 


electronics —> 


Angle between banana and orange is a’, between orange and mobile is b°, and 
between banana and mobile it is c’. 


We can see here a < b < c, hence cos(a) > cos(b) > cos(c) 


So it indicates similarity between banana and orange is highest, which is 
true. 


Scaled Dot Product 


Another measure of similarity is scaled-dot-product, where dot-product is 
divided by the square root of the length of the vector. 
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banana and orange -> 
1*2 + 8*6 50 
2 


2 


= 
S 
£ 


Y) a 
orange and mobile -> 
banana 
2*8 + 6*1 Vip 
ee ae 
/2 1/2 


banana and mobile -> 


orange pe fe | 
. 1 a, a ee : 
mobile f2 f2 


To account for varying lengths of representation, scaled dot product can be 
used that normalizes the dot product by the representation vector’s length 


g 
S 
~ 
G 
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Periodic Functions: Trigonometry Recap 


Sine & Cosine 
y =asin(b(x+c))+d 
y = acos(b(x+c))+d 


The general equation for a sine or cosine function looks like below: 


where a=amplitude (which decides vertical stretch) 
b = frequency (which decides horizontal stretch) 


2a/b = period 
c = phase shift 
d= vertical shift 
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Amplitude 

The amplitude of a sine or cosine curve is its height which defines the 
vertical stretch. It represents half the distance between the maximum and 
minimum values of the function. 


Period 
Horizontal stretch is measured for sine and cosine functions as their 


periods. This is why these functions are also called the periodic 

function family. The period of a sinusoid is the length of a complete cycle or 
measure from peak to peak. For basic sine and cosine functions, the period 
is 22. 


Frequency 
Frequency is a different way of measuring horizontal stretch. With sine or 


cosine functions, frequency is the number of cycles that occur in 27. A 
shorter period means more cycles can fit in 2z and thus a higher frequency. 
Period and frequency are inversely related by the following equation: 


20 


Period =—————— 
freuency 


Period 


Amplitude 


2n 


Period 


Frequency = 
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Phase Shift 
Phase shift is how far the function is shifted horizontally from the original 


poisition, 


Vertical Shift 
Vertical shift is how far the function is shifted vertically from the usual 


position. Positive is upward 
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CLOSING THOUGHTS 


Our goal throughout this book has been to make these advanced concepts 
accessible, breaking them down into comprehensible parts and illustrating 
their practical applications. By demystifying the theory behind 
transformers, we hope to have provided you with a robust understanding 
and the confidence to apply these techniques in your own work. 


As you move forward, remember that the field of artificial intelligence is 
continuously evolving. The knowledge and skills you have acquired will 
serve as a strong foundation for further learning and innovation. Stay 
curious, keep experimenting, and don’t hesitate to push the boundaries of 
what is possible with transformer models. 


Thank you for joining me on this journey. I hope this book has not only 
expanded your knowledge but also inspired you to explore the limitless 
potential of transformer architectures. The future of Al is bright, and with 
your newfound understanding, you are well-equipped to be a part of its 
exciting evolution. 


Here's to the future of artificial intelligence, where the foundations laid by 
transformer models will continue to drive innovation and discovery. Happy 
exploring and creating! 

I would love to hear your thoughts on this book. Your feedback and reviews 
are greatly appreciated and will help improve future editions. Please take a 
moment to share your experience and suggestions. Thank you for your 
support and happy learning! 


Best regards, 


Debstuti Das 
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